[BUG] Fix source dataset issue when running link jobs #1193
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Quick Summary
It looks as if there was a bug in vertically_concatenate.py that was causing two
source_dataset
columns to be produced in various steps, wheresource_dataset
was outlined in the settings object.This was causing issues in DuckDB, where I couldn't use a vertically concatenated dataframe with a
source_dataset
column for a link job -- i.e. a single df withsource_dataset
which outlines which dataset a record belongs to.But, more troubling than this, it appears that this was just outright breaking link only jobs in spark. I'll post some code that breaks below.
Essentially though, it was causing this behaviour in
concat_with_tf
, which was causing spark to throw a wobbly.This code fixes the issue by migrating
_source_dataset_column_name
to the linker class and checking whether the column already exists within the user's database.Internals and why I opted to go down this path:
To start - these changes don't adjust the underlying SQL/logic that's being used wherever I've replaced
linker._settings_obj_source_dataset_column_name
withlinker._source_dataset_column_name
.The workflow is still:
concat_with_tf
The change is in the naming convention used and what we output in
predict()
.Now if the users provides a dataframe that already contains
source_dataset
, splink will adjust step 1 to use the alias__splink_source_dataset
. This will then be scrapped when it's no longer needed and the user will be left with their originalsource_dataset
in the output.If the users provides a df without
source_dataset
, splink won't fall over and will just call step 1source_dataset
.Why this way?
source_dataset
casesource_dataset
column, the link job will still run.On point 3 - We may want to check the source dataset column in preprocessing and establish if it's valid.